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An improved control structure for 
music synthesis is provided in which: 1) 
the sound generation provided to the adap- 
tive function mapper (501) allows for a 
greatly increased degree of control over 
the sound produced; and 2) training of 
the adaptive function mapper (501) is per- 
formed using an error measure, or error 
norm, that greatly facilitates learning while 
ensuring perceptual identity of the pro- 
duced sound with the training example. In 
one embodiment, sound data is produced 
by applying to an adaptive function map- 
per (501) control parameters including: at 
least one parameter selected from the set 
of time and timbre space coordinates; and 
one parameter from the set of pitch, delta 
pitch, articulation and dynamic. Mapping 
is performed from the control parameters 
to synthesis parameters to be applied to 
a sound synthesizer. In another embodi- 
ment, an adaptive function mapper (501) 
is trained to produce, in accordance with 
the information stored in a mapping store 
(503), synthesis parameters to be applied 
to a sound synthesizer, by steps including: 
analyzing s unds to produce sound parameters describing the sounds; further analyzing the sound parameters t produce c ncrol parameters; 
applying the c ntrol parameters to the adaptive function mapper (501), the adaptive ftincti n mapper (501) in response producing trial 
synthesis parameters comparable to the sound parameters; deriving from the sound parameters and the trail synthesis parameters an error 
measure in acc rdance with a perceptual error norm in which they are weighted in approximate degree to which they are perceived by the 
human ear, and adapting the information in the mapping store (503) with the error measure. 
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CONTROL STRUCTURE FOR 
SOUND SYNTHESIS 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

5 The present invention relates to control structures for computer- 

controlled sound synthesis. 

2. State of the Art 

The application of computers to sound synthesis has been studied and 
practiced for many years. Whereas the computer synthesis of simple sounds is 
10 straightforward, the problem of synthesizing complex, realistic sounds such as 
the human voice, the sound of a piano chord being played, a bird call, etc., has 
posed a continuing challenge. 

One well-known technique of synthesizing complex sounds is that of 
additive synthesis. In conventional additive synthesis, a collection of sinusoidal 
15 partials is added together to produce a complex sound. To produce a complex, 
realistic sound may require as many as 1000 sinusoidal partials to be added 
together. Each sinusoidal partial must be specified by at least frequency and 
amplitude, and possibly phase. Clearly, the computational challenge posed in 
producing complex, realistic sounds by additive synthesis is considerable. 
20 Furthermore, the greatest benefit is obtained when additive synthesis is used 

to produce complex, realistic sounds in real time. That is, the synthesis system 
should be able to accept a series of records each specifying the parameters for a 
large number of partials and to produce from those records a complex, 
interesting, realistic sound without any user-perceptible delay. 
25 Two approaches to additive synthesis have been followed. In the first 

approach (the time-domain, or wavetable, approach), the equivalent of a bank 
of oscillators has been used to directly generate sinusoidal partials. The 
frequency and amplitude values of all of the partials have been applied to the 
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oscillators in the oscillator bank, and the resulting partial* have been added 
together to produce the final sound. The requirement of directly computing each 
partial individually has limited the number of partials that may be included in a 
sound so as to allow the sound to be produced in a reasonable period of time. 

In the second approach (the frequency-domain approach), partials have been 
specified and added in the frequency domain to produce a spectrum, or 
frequency-domain representation, of the final sound. The inverse Fourier 
transform is then used to compute the time-domain representation of the final 
sound, from which the sound is then produced. 

An IFFT additive synthesis technique is described in U.S. Patent 
5,401,897, incorporated herein by reference. In the described additive sound 
synthesis process, sample: blocks are determined by carrying out the inverse 
Fourier transform of successive frequency spectra. The sample blocks are time- 
superimposed and added to form a sequence of samples representing a sound 
wave. The latter procedure is known as overlap-add. 

Other patents relating to additive sound synthesis; include the following: 
U.S. Patent No. 4,856,068; U.S. Patent No; 4,885,790; U.S. Patent No. 
4,937,873; U.S. Patent No. 5,029,509; U.S. Patent No. 5,054,072; and U.S. 
Patent No. 5,327,518; all of which are incorporated herein by reference. 

Prior art additive synthesis methods of the type described, however, have 
remained limited in several respects. Many of these limitations are addressed 
and overcome in copending U.S. Patent Application Serial No. 08/551,889 
(Attorney's Docket No. 028726-008), entitled Inverse Transform Narrow 
Band/Broad Band Additive Synthesis, filed on even date herewith and 
incorporated herein by reference. Not addressed in the foregoing patent 
application is the problem of constructing a suitable control structure that may 
be used to control additive sound synthesis in real time. Prior art methods have 
typically been limited to generating and playing sound described by pre-stored, 
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analyzed parameters rather than values that change in real time during 
synthesis. 

As recognized by the present inventors, the problem of constructing a 
suitable control structure that may be used to control additive sound synthesis in 
5 real time involves two sub-problems. One problem is to provide a user interface 
that may be readily understood and that requires only a minimum of control 
input signals. In other words, the user interface must offer simplicity to the 
user. Another problem is to translate this simplicity seen by the user into the 
complexity often required by the synthesizer and to do so in a time-efficient and 
10 hardware-efficient manner. 

An important contribution to the user interface problem is found in Wessel, 
Timbre Space as a Musical Control Structure, Computer Music Journal 3(2):45- 
52, 1979, incorporated herein by reference. A fundamental musical property is 
that of timbre, i.e., the tone and quality of sound produced by a particular 
15 instrument. For example, a violin and a saxophone each have distinctively 

different timbres that are readily recognizable. The foregoing paper describes 
how to construct a perceptually uniform timbre space. 

A timbre space is a geometric representation wherein particular sounds with 
certain qualities or timbres are represented as points. The timbre space is said 
20 to be perceptually uniform if sounds of similar timbre or quality are proximate 
in the space and sounds with marked difference in timbre or quality are distant. 
In such a perceptually uniform timbre space, perceptual similarity of timbres is 
inversely related to distance. 

The basic idea is that by specifying coordinates in a particular timbre space, 
25 one is able to hear the timbre represented by those coordinates (e.g., a violin). 
If these coordinates should fall between existing tones in the space (e.g., in 
between a violin and a saxophone), an interpolated timbre results that relates to 
the other sounds in a manner consistent with the structure of the space. Smooth, 
finely graded timbral transitions can thus be formed, with the distance moved 
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15 



50 



within the timbre space bearing a uniform relationship to the audible change in 
timbre. 



Also discussed in the paper is the need to reduce the considerable quantity 
of data required by a general synthesis techniques such as additive synthesis 
without sacrificing richness in the sonic result. The approach suggested is the 
use of straight-line-segment approximations to approximate curvilinear envelope 
functions. 

More recently, advances in machine learning techniques such as neural 
networks have been applied to the second sub-problem, that is translating the 
simplicity seen by the user into the complexity often required by the synthesizer 
and to do so in a time-efficient and hardware-efficient manner. Neural networks 
may be considered to be representative of a broader class of adaptive function 
mappers that map musical control parameters to the parameters of a synthesis 
algorithm. The synthesis algorithm typically has a large number of input 
parameters. The user interface, also referred to as the gestural interface 
typically supplies fewer parameters. The adaptive function mapper is therefore 
required to map from a low dimensional space to a high dimensional space. 

The use of a neural network in an electronic musical instrument is 
described in U.S. Patent 5,138;924, incorporated herein by reference. Referring 
to Figure 1, in accordance with the foregoing patent, a neural network 134 is 
used to translate user inputs from a wind controller 135 to outputs used by a 
synthesizer 137 of an electronic musical instrument. The synthesizer 137 is 
shown as being an oscillator bank. In operation, the player blows in breath 
from the mouthpiece 140, and controls the key system 141 with the fingers of 
both hands to play the instrument. Each key composing the key system 141 is 
an electronic switch. The ON/OFF signals caused by operation are input to the 
input layer 142 of the neural network 134. The neural network 134 is a 
hierarchical neural network having four layers, namely an input layer 142 a 
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first intermediate layer 143, a second intermediate layer 144, and an output 
layer 145. 

The number of neurons of the output layer 145 is equal to the number of 
oscillators 146 and attenuators 147. Each pair of neurons of the output layer 
145 outputs the frequency control signal of the sine wave to be generated to the 
respective oscillator 146 and an amplitude control signal to the corresponding 
attenuator 147. The sine wave generated by the oscillator is attenuated to the 
specified amplitude value and input to an adding circuit 148. In the adding 
circuit 148 all the sine waves are added together with the resulting synthesis 
signal being input to the D/A converter 149. In the D/A convener 149 the 
synthesis signal is shaped to obtain a smooth envelope and is then output as a 
musical sound, which is amplified by a sound system (not shown). 

In the foregoing arrangement, because additive synthesis is used, it is 
possible to use the results of analysis by FFT as training patterns for the neural 
network. That is, a musical tone of a specific pitch of the musical instrument to 
be learned is FFT-analyzed, and the results of the FFT (to which the ON/OFF 
pattern used to generate the tone corresponds) is input to the neural network as 
a training pattern. This process is performed for the entire range of tones to be 
produced. 

Many of the techniques employed in additive music synthesis have been 
adopted from work in the area of speech analysis and synthesis. Further 
information regarding the application of neural networks and machine learning 
techniques to music synthesis can be found in Rahim, Artificial Neural 
Networks for Speech Analysis/Synthesis, Chapman & Hall, 199?. 

Despite the known use of adaptive function mappers that map musical 
control parameters to the parameters of a synthesis algorithm, there remains a 
need for an improved control structure for music synthesis in which: 1) the 
sound representation provided to the adaptive function mapper allows for a 
greatly increased degree f control over the sound produced; and 2) training of 
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the adaptive Action mapper is perfonned using an error measure, or error 
norm, that greatly facilitates learning while ensuring perceptual identity of the 
Placed sound with the training example. T*e present invention addresses this 



10 



15 



20 



25 



SUMMARY OF THE INVENTION 
Th e present invention, generally speaking, provides for „ improved 

"7 mUSiC Syn,heSiS WWCh: 1} *• ~ Provide. ,„ 
the adaptive function mapper allows for a greatly incmased degree of comrol 
over the sound produced; and 2) training of the adaptive function mapper is 
performed using an error measure, or error norm, that greatly facilitates 
learntng while ensuring perceptual identity of the produced sound with the 
aatnrng example. In accordance with one embodiment of the invention, sound 
data ts produced by applying to an adaptive function mapper control parameters 
mcludmg: at leas, one parameter selected from the set of time and timbre space 
coordinates; and a. fens, one parameter selected from the set of pitch> ipitch 
angulation and dynamic. Using the adaptive function mapper, mapping is ' 
performed from the contro. parameters to synthesis parameters to be applied to 
a sound synthesis. In accordance with another embodiment of the invention 
an adaptive function mapper is trained to produce, in accordance with 
mformation stored in a mapping store, synthesi, parameters to be applied to a 
sound synthesizer, by steps including; analyzing sounds to produce sound 
parameters describing the sounds; tamer anafyzing the sound parameters to 
produce control parameters; applying the control parameters to me adaptive 
function mapper, the adaptive function mapper in response producing trial 
synthesis parameters comparabfe to the sound parameter; deriving from the 
sound parameters and the trial synthesis parameters an error measure in 
accordance with a perceptual enor norm in which a. leas, some error 
contributions are weighted in appro^ degree „, which ^ m 
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by the human ear during synthesis; and adapting the information stored in the 
mapping store in accordance with the error measure. 

BRIEF DESCRIPTION OF THE DRAWING 
The present invention may be further understood from the following 
5 description in conjunction with the appended drawing. In the drawing: 

Figure 1 is a diagram of a conventional electronic musical instrument 
using a neural network; 

Figure 2 is an overall block diagram of an inverse transform additive sound 
synthesis system in which the present invention may be used; 
10 Figure 3A is a graph showing the temporal evolution of partials making up 

a given sound; 

Figure 3B is a diagram of a neural network that may be used as a control 
structure to produce parameters to be used in synthesis of the sound of Figure 
3A; 

15 Figure 3C is a collection of graphs showing the temporal evolution of 

partials making up similar sounds of different timbres within a timbre space; 

Figure 3D is a diagram of a neural network that may be used as a control 
structure to produce parameters to be used in synthesis of the sounds of Figure 
3C; 

20 Figure 4A is a collection of graphs showing the temporal evolution of 

partials making up similar sounds of different percussive timbres within a 
percussive timbre space; 

Figure 4B is a diagram of a neural network that may be used as a control 
structure to produce parameters to be used in synthesis of the sounds of Figure 

25 4A; 

Figure 5 is a block diagram of the control structure of Figure 2; 
Figure 6 is a block diagram of the control structure of Figure 2 as 
configured during training; 
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Figure 7 is a graph of a frequency dependent weighting function used 
during training; 

Figure 8A is a graph of the temporal evolution of two successive notes 
played in a detached style; 

5 Figure 8B is a modified version of the graph of Figure 8A, showing how a 

smooth transition between the two notes may be constructed in order to 
simulate playing of the notes in a more attached style; 

Figure 9A and Figure 9B are graphs of the evolution of the overall 
amplitudes of two sounds, showing how the two sounds may be mapped to a 
10 common time base. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 
In the following description, a clear separation is observed between sound 
synthesis per se and the distinct problem of producing parameters to be used to 
control sound synthesis so as to obtain production of a desired sound. The 
present control structure produces appropriate parameters for sound synthesis 
which is then assumed to be performed by an appropriate sound synthesizer, 
such as that described in the aforementioned copending U.S. Application Serial 
08/551,889. Preferably, the synthesizer is capable of real-time operation so as 
to respond with nearly imperceptible delay to user inputs, as from a keyboard, 
footpedal, or other input device. Of course, the present invention is broadly 
applicable to sound synthesizers of all types. Hence, the description of the 
sound synthesizer that follows should be regarded as merely exemplary of a 
sound synthesizer with which the present invention may be used. 

Referring now to Figure 2, a control structure 500 is shown in relation to 
such a synthesizer. The control structure 500 provides parameters to various 
blocks of the sound synthesis system, which will be briefly described. The 
architecture of the system is designed so as to realize an extremely versatile 
sound synthesis system suitable for a wide variety of applications. Hence, 
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certain blocks are provided whose functions may be omitted in a simpler sound 
synthesis system. Such blocks appear to the right of the dashed line 13 in 
Figure 2. The function of the remaining blocks in Figure 2 will therefore be 
described first. 

5 In the prior art inverse transform additive sound synthesis system of U.S. 

Patent 5,401,897, and in other conventional additive sound synthesis systems, a 
frequency spectrum is obtained by adding discrete spectral components grouped 
in spectral envelopes. Each spectral envelope corresponds to a sinusoidal 
component or a spectral noise band. Noise bands are statistically independent, 
10 i.e., generated by a mechanism independently defined and unrelated to the 
mechanism by which the sinusoidal components are generated. 

In the inverse transform additive sound synthesis system of Figure 2, on the 
other hand, partials need not be sinusoidal but may assume any of various 
forms of narrow band components. Hence, the terms "spectrum", "spectra" 
15 and "spectral" usually used in describing sound synthesizers apply to the 
synthesizer of Figure 2 only in the broad sense as connoting a sound 
representation in a domain other than the time domain, and do not necessarily 
connote representation in terms of sinusoidal components. Furthermore, broad 
band components, rather than being defined independently of the narrowband 
20 components, may be generated such that the broad-band-component generating 
mechanism is bound up in the narrow-band-component generating mechanism. 
Consequently, the blocks 89 and 87 in Figure 2, although they may be 
considered to bear a superficial correspondence with the prior an mechanisms 
of generating sinusoidal partials and noise bands, respectively, should be 
25 thought of more generally as performing narrow-band synthesis (89) and broad- 
band synthesis (87). The narrow-band synthesis block 89 and the broad-band 
synthesis block 87 are controlled by control signals from the control structure 
500. 
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Narrow-faand components and broad-band components are added together in 
a transform sum-and mix-block 83. The transform sum-and-mix block 83 is 
controUed by control signals from the control structure 500. The transform 
sum-and-mix block 83 allows for selective distribution, or -dosing, - of energy 
in a given partial between separate transform sums. This feature provides the 
capability for polyphonic effects. 

The transform sum-and-mix block also provides signals to the control 
structure 500. Considerable advantage may be obtained by, for example, using 
the spectral representation found in one or more of the transform sums to 
provide a real-time visual display of the spectrum or other properties of a 
signal. Since a transform-domain representation of the signal has already been 
created, only a minimum of additional processing is required to format the data 
for presentation. A transform sum (e.g., constructed spectrum) may be 
displayed, as well as the magnitudes and frequencies of individual partials. 

Furthermore, the spectral representation found in one or more of the 
transform sums may be used as real-time feedback to the control structure 500 
to influence further generation of the same transform sum or the generation of a 
subsequent transform sum. 

A transform domain filtering block 79 receives transform sums from the 
transform sum-and-mix block and is designed to perform various types of 
processing of the transform sums in the transform domain. The transform 
domain filtering block 79 is controlled by control signals from, and provides 
signals to, the control structure 79. The transform domain lends itself to readily 
performing various types of processing that can be performed in the time 
domain or the signal domain only with considerably greater difficulty and 
expense. 

Transform domain processing allows accommodation of known perceptual 
mechanisms, as well as adaptation to constraints imposed by the environment in 
which the synthesized sound is to be heard. By way of example only, transform 
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domain processing may be used to perform automatic gain control or frequency- 
dependent gain control. Similarly, simulations of auditory perception may be 
used to effectively "listen" to the sound representation before it is synthesized 
and then alter the sound representation to remove objectional sounds or 
5 perceptually orthogonalize the control parameter space. 

Following transform domain processing, the sound representation is 
synthesized using a bank of inverse transform/overlap-add operations 73 to 
transform each transform sum. Each inverse transform IT indicated in Figure 2 
bears an approximate correspondence to the conventional inverse Fourier 
10 transform previously described. However, the inverse transform need not be a 
Fourier inverse transform, but may be a Hartley inverse transform or other 
appropriate inverse transform. The number of transforms computed, n.t., is 
limited only by the available computational power. 

Time-sampled signals produced by the inverse transform/overlap-add bank 
15 73 are input to an output matrix mix block 71. The output matrix mix block is 
realized in a conventional manner and is used to produce a number of output 
signals, n.o., which may be the same as or different than the number of 
transforms computed, n.t. The output signals are D-to-A converted and output 
to appropriate sound transducers. 
0 The sound synthesis system described produces sounds from a parametric 

description. To achieve greater flexibility and generality, the blocks to the right 
of the dashed line 13 may be added. These blocks allow stored sounds, real- 
time sounds, or both, to be input to the system. 

Sound signals that are transform coded are stored in a block 85. Under 
5 control of the control structure 500, these signals may be retrieved, transform 
decoded in a transform decode block 81, and added to one or more transform 
sums. The stored signals may represent pre-stored sounds, for example. 

Real-time signals may be input to block 75, where they are forward 
transformed. A block 77 then performs transform filtering of the input signals. 
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^^.tiansfnned signals aro then add* to™ „ ^ 

me control of the control structure 500 * 

22 ti ° n> " - * - rf o™ «* he input to a Meek 

72 that performs analysis and sys*m identification. System identification 

derrving a paramertc representation of the sign*. Results from an 
^ ~» may be fed back to the control structure 500 and used in the 
«» of construct of subsequent specda or the modification of the current 
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° f «« «"«■. 500 of Hgure 2 may be more clearly 

undersrood with reference to Figure 3 A and succeeding figures, m order^T 

rr of a singie ^ ° f a ^ ^ * — *«»• — 

beab e to output the correct amplitudes for each partial within the sound (or at 

l ZTr y*"* M " * *« * »nnd. Lme 

pamals have relatively large amphtitde and other partials have tOativdy smafi 
■mputudes. Partial of different fluencies evo,ve differendy over time T 
"urse, in actual practice, time is measured discretely, such mat the control 
sm-enne outputs the ampfitudes for the partials « ^ dme kamm 
*e course of the sound. A neural network of the general type shown i„ Hgure 
3B may be used to "memoriae" the temporal evolution of me partials for the 
sound and to produce dam describing the sound. In particular, the neural 
network of Figure 3B has a time input unit, a number of hidden units, and a 
number of output units equal to the number of partials in the sound to be 
synthesized. When a particular time increment is specified by mputting a 
corresponding time signal to the time unit, each output unit specifies a 
frequency component's amplitude during that time increment 

To increase its versatility, the control structure of Figure 3B may be 
generalized in order to produce data describing similar sounds in different 
timbres within a timbre space. Referring to Hgure 3C, the sound of Hgure 3A 
» now represented as a single sound within a family of sounds of diffel 
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timbres. The sounds are arranged in a timbre space, a geometrical construct of 
the type previously described. A neural network of the general type shown in 
Figure 3D is provided with additional inputs X and Y in its input layer to allow 
for specification of a point within the timbre space. The neural network may be 
5 used to -memorize" the temporal evolution of the partials for each sound and to 
produce data describing the appropriate sound of a selected timbre in 
accordance with the time input and the application of timbre space coordinates 
to the input nodes. 

The apparent simplicity of providing a time input belies the remarkable 
10 increase (as compared to the prior art) in the power of the control structure that 
results, providing the power to control the synthesis of a broad universe of 
sounds. Single sounds as a result become very elastic, susceptible to being 
stretched or compressed in various ways without altering the quality of the 
sound. Furthermore, the time input allows for differences in the time bases of 
15 different sounds to be accounted for in order to produce sounds by interpolating 
between various other sounds, without producing artifacts. This feature is 
explained in greater detail hereinafter. 

The foregoing description has assumed that the sound to be synthesized is 
harmonic. The same method may be applied, however, to percussive sounds, as 
20 shown in Figure 4A and Figure 4B. Of course, percussive tones may have 

different timbres (e.g., the sound of drum as distinguished from the sound of a 
bell). Figure 4A and Figure 4B therefore show a percussive tone timbre space 
and a neural network having timbre space coordinate inputs, respectively. Note 
that partials rise almost instantaneously to respective peak values at the 
25 beginning of the sound (corresponding to a time when the percussive sound is 
struck) and then decay exponentially in accordance with a particular time 
constant. Each partial may be described throughout its duration in terms of an 
initial amplitude and a time constant. In the neural network of Figure 4B, 
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therefore, the input layer does not have a time input. The output layer produces 

an amplitude and a time constant for each partial. 

Referring now to Figure 5, the control structure 500 of Figure 2 will be 
described in greater detail. The control structure 500 is realized in the form of 
an adaptive function mapper 501. In a preferred embodiment, the adaptive 
function mapper 501 is a neural network. In other embodiment, the adaptive 
function mapper 501 may take the form of a fuzzy logic controller, a memory- 
based controller, or any of a wide variety of machines that exhibit the capability 
of supervised learning. 

Basically, the role of the adaptive function mapper 501 is to map from 
control parameters within a low-dimensional control parameter space to 
synthesis parameters within a high-dimensional synthesis parameter space This 
mapping is performed in accordance with data stored in a mapping store 503 
In particular, the mapping store 503 contains weights applied to various error 
terms during supervised learning and changed in accordance with a supervised 
learning procedure until an acceptable error is achieved. The adaptive function 
mapper 501 will then have been trained and may be used in -production mode- 
rn which different combinations and patterns of control parameters are applied 
to the adaptive function mapper 501 in response to the gestures of a user The 
adaptive function mapper 501 maps from the control parameters to synthesis 
parameters which are input to a spectral sound synthesis process 70 (such as the 
one shown in Figure 2) in order to synthesize a corresponding pattern of 
sounds. 
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In a preferred embodiment, the control parameters include the following: 

Table 1 



CONTROL PARAMETER 


DESCRIPTION 


time (canonical) 


lAn identifier of the temporal progression of a 
sound relative to a sound template having 
different phases such as attack, sustain, release. 


pitch 


rThe "note* to be played. 


a pitch 


lA pitch offset. May be used to sharpen a note or 
flatten a note or varied up and down to achieve a 
kdbrato effect. 


dynamic 


|How loud or how soft the note is to be played 7~ 


articulation 

i 


A description of a desired transition from one 
uui& vkj «*w iic a i in icrms oi i ) ine pitcn ot tne 
previous note; 2) the dynamic of the previous 
note; and 3) the time between the release point of 
the previous note and the attack of the present 
note. 


timbre space coordinates ] 

Is 


[dentify a point in a timbre space, preferably a 
perceptually uniform timbre space. May identify a 
x)int corresponding to a real instrument (oboe, 
rumpet, etc.) or an intermediate point having a 
ynthetic timbre. 



The organization represented by the foregoing control parameters is of quite 
fundamental significance in several respects. First, with respect to the purely 
musical parameters of pitch, Apitch, articulation and dynamic, only pitch and 
dynamic are incorporated into simpler prior art models such as that of Figure 1. 
Implicit in Figure 1 is the musical parameter of instrument, corresponding to a 
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single point in timbre space, or possibly one of multiple points in timbre space 
each corresponding to a real instrument. Without accounting for .pitch and 
articulation, there can be produced only very simple musical expressions of 
detached notes played in one or maybe a few real timbres without vibrato or any 
*milar effect. Furthermore, the manner in which .pitch and articulation might 
be accounted for is not at all apparent from conventional models. 

Second, with respect to the parameters of time and timbre space coordinates 
these parameters are not musical parameters in the traditional sense, in that they 
represent properties that can only be controlled using a digital computer. The time 
parameter represents time in intervals of a few milliseconds, an interval finer than 
the ability of the human ear to perceive, and furthermore represents canonical 
time, thereby providing a common time base between different sounds. Unlike real 
time, which progresses forward at a fixed rate, canonical time may be advanced 
retarded, or frozen. The ability to freeze time allows for a considerable reduction 
to be achieved in the volume of training data required, since synthesis parameters 
corresponding to a single frame of steady-state sample data can be held 
^definitely. The timbre space parameters specify not only real instruments but also 
an infinitude of virtual instruments, all arrayed in such a manner as to be 
intelligently manipulated by a user. 

In a preferred embodiment, the synthesis parameters output by the adaptive 
function mapper 501 are those employed by the spectral sound synthesis process 
70 of Figure 2. That is, the adaptive function mapper 501 outputs an amplitude 
signal for each of a multitude of partials. The adaptive function mapper 501 also 
outputs signals specifying a noise pan of the sound, including signals specifying 
broadband noise and signals specifying narrowband noise. For broadband noise 
the adaptive function mapper 501 outputs a noise amplitude signal for each of a 
number of predetermined noise bands. For narrowband noise, the adaptive function 
mapper 501 outputs three signals for each narrowband noise component: the center 
frequency of the noise, the noise bandwidth, and the noise amplitude. The adaptive 
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function mapper 501 may be configured to output only a single narrowband noise 
component or may be configured to output multiple narrowband noise components. 
The output of the adaptive function mapper 501 may therefore be represented as 
follows: 

*if *2f. a«» Noise part (Broadband) (Narrowband), 
where ^ represents the amplitude of a partial. 

The adaptive function mapper 501 is trained on "live" examples, that is 
sounds captured from playing of a real instrument by a live performer. The 
training data is prepared in a systematic fashion to ensure the most satisfactory 
results. The preparation of the training data will therefore be described prior to 
describing the actual training process (Figure 6). 

An object of training is to populate the timbre space with points 
corresponding to a variety of real instruments. In between these points the 
adaptive function mapper is then able to, in effect, interpolate in order to create 
an almost infinite variety of synthetic timbres. Therefore, recording sessions are 
arranged with performers playing real instruments corresponding to points 
located throughout the timbre space. The instrument may be an oboe, a french 
horn, a violin, etc. The instrument may also be a percussion instrument such as 
a bell or a drum, or even the human voice. During a session, the performer 
wears headphones and is asked to play, sing, or voice scales (or some other 
suitable progression) along with a recording of an electronic keyboard, 
matching the recording in pitch, duration and loudness. The scales traverse 
substantially the entire musical range of the instrument, for example three 
octaves. 

By conducting repeated such sessions with a variety of instruments, live 
samples are obtained corresponding to points throughout most of the control 
parameter space, i.e., the portion of the control parameter space characterized 
by timbre, pitch, loudness and Apitch. Note that the Apitch parameter is 
ignored during the recording session. The Apitch parameter may be ignored 
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during recording because it is a derivative parameter related to the pitch 
parameter, which is accounted for during performance. The .pitch parameter 
must be accounted for after performance and before training. This accounting 
for .pitch is done, in approximate terras by analyzing pitch changes during 
performance and -adding a .pitch track" to the recording describing the pitch 
changes. Explicitly accounting for .pitch makes it possible, for example for a 
performer to use vibrato during a recording session, as experienced performers 
will almost inevitably do, but for that vibrato to be removed if desired during 
synthesis. 

The samples obtained in the manner described thus far are detached 
samples, i.e., samples played in the detached style in which the previous note 
has decayed to zero before the next note is begun. The other chief articulation 
style is legato, or connected. The performer is therefore asked to played 
various note combinations legato, over small note intervals and over large note 
intervals, as well as in the ascending and descending directions. The articulation 
parameter dimension of the control parameter space will typically be sampled 
sparsely because of the vast number of possible combinations. Nevertheless a 
complete set of articulation training examples may be obtained by -cutting and 
pasting- between samples in the following manner. 

Referring to Figure 8A, performance examples may have been obtained for 
two different notes each played in a detached manner. Because the articulation 
parameter dimension of the control parameter space is sampled sparsely, no 
performance example may have been obtained of the same two notes played in 
close succession in a more attached style. Such a performance example may be 
constructed, however, from the performance examples of the two different 
notes each played in a detached manner. Such construction requires that the 
decay segment of the first note be joined to the attack segment of the second 
note in a smooth, realistic-sounding manner. 
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The nature of the transition will depend primarily on the desired articulation 
and on the timbre of the notes. That is, the shape of the transition will depend 
on whether the notes are those of a violin, a trombone, or some other 
instrument By observing analysis results of various articulation examples in 
5 various timbres, appropriate transition models may be derived for constructing 
transition segments using the amplitudes of partials from the decay segment of 
the first note and the amplitudes of partials from the attack segment of the 
second note. A further input to the transition model is the parameter At 
describing-the desired articulation, shown in Figure 8B as the time from the 
3 release point of the first note to the decay point of the second note. 

After a sufficient set of articulation examples is obtained, either by live 
performance, construction as described above, or typically some combination of 
the same, each sound in the resulting library of sounds is then transformed 
using short-term-Fourier-transform-based spectral analysis as described in 
various ones of the previously cited patents. The sounds are thus represented in 
a form suitable for synthesis using the spectral sound synthesis process 70. 
Before training can begin, the sound files must be further processed 1) to add 
Apitch information as previously described; 2) to add segmentation information, 
identifying different phases of the sound in accordance with the sound template; 
and 3) to add time information. These steps may be automated to a greater or 
lesser degree. The third step, adding information concerning canonical time, or 
normalized time, to each of the sounds, is believed to represent a distinct 
advance in the art. 

In order to establish the relationship between real time and the common 
time base called canonical time, a common segmentation must be specified for 
the different tones involved. Segmentation involves identifying and marking 
successive temporal regions of the sounds, and may be performed manually or, 
with more sophisticated tools, automatically. In Figure 9A and Figure 9B, 
sound A and sound B have a common segmentation in that the various 
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2 - 3 »d 4 can be associated with each other. Canonical time is 
calculated by determining the proportion of real time tha, has elapsed in a given 
segment. FoUowing this method, the canonical time at me beguming of a 
segment is 0.0 and at the end 1.0. The canonical dme halfway through the 
segment is 0.5. In this manner, any given point in real dme can be given a 
canonical time by firs, identifying me segment containing the time point and 
then by determining what proportion of the segment has elapsed 

Following post-processing of th. sound files in the marmer described above 
tratmng of the adaptive function mapper 501 may begin. For this purpose all ' 
of the sound files are concatenated into one large owning flle . Tnining ' 
take several hours, a day, or several days depending on the length of the 
training file and the speed of the computer used. 

Referring „ Figure 6, during training, control parameters for each frame of 
tiammg data stored in a store 601 are applied in tun, to the adaptive function 
ntapper 501. At the same time, the corresponding synthesis parameters, also 
stored in the store 601, are applied to a perceptual etror norm block 603 The 
output signals of the adaptive function mapper 50. produced in response to the 
control pammeters are also input «, me perceptual em,r norm block 603 A 
P«ceptual error norm is calculated in accordance with the difference between 
the output signals of the adaptive function mappe, 501 and the corresponding 
^thesis paramemrs. Information within the mapping store is varied in 
accordance with the perceptual error norm. Then a next frame is processed 
Tranung continues until an acceptable error is achieved for every sound frame 
within the training data. 

In an exemplary embodiment, the adaptive function mapper 501 is realized 
as a neural network simulated on a Silicon Graphics Indigo- computer. In one 
«ample. the neural network had seven processing units i„ m mput Uyer ^ 
processing units in an intermediate layer, and eighty output units in an output 
layer, with the nerwork being fully c«m„e«ed. In the same example, me neural 
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network was trained using the well-known back propagation learning algorithm 
Of course, other network topologies and learning algorithms may be equally 01 
more suitable. Furthermore, various other types of learning machines besides 
neural networks may be used to realize the adaptive function mapper 501. 

Note in Figure 6 that the error norm computed by the block 603 is a 
perceptual error norm, i.e., an error norm in which at least some error 
contributions are weighted in approximate degree to which they are perceived 
by the human ear during synthesis. Not all errors are perceived equally by the 
human ear. Hence, training to eliminate errors that are perceived by the human 
ear barely if at all is at best wasted effort and at worst may adversely affect 
performance of the adaptive function mapper 501 in other respects. By the same 
token, training to eliminate errors that are readily perceived by the human ear 
is essential and must be performed efficiently and well. 

In a preferred embodiment, the perceptual error norm computed by the 
block 603 mimics human auditory perception in two different ways. First, 
errors are weighted more heavily during periods of considerable change and are 
weighted less heavily during periods of little change. Second, errors are 
weighted more heavily at high frequencies than at lower frequencies, in 
recognition of the fact that the human ear perceives errors in the high frequency 
range more acutely. The former is referred to as temporal envelope error 
weighting and the latter is referred to as frequency dependent error weighting. 
With regard to frequency dependent error weighting, in one experiment, for 
example, partials were successively added within a set frequency interval to 
form a resulting succession of sounds, each more nearly indistinguishable from 
the previous sound, first in a low frequency range and then in a high frequency 
range. In the low frequency range, after only a few partials, the successive 
sounds became indistinguishable. In the high frequency range, several tens of 
partials were added before the successive sounds became indistinguishable, 
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demonstrating that the ear is very sensitive to fine structure in the high 
frequency range. 

More particularly, in a preferred embodiment, the error with respect to 
each output signal of the adaptive function mapper 501 is calculated in 
accordance with the following equation: 



where a i is the desired synthesis parameter, is the corresponding output 
sxgnal of the adaptive function mapper 501, RMS is the error envelope and f 
and g represent monotonic increasing functions. The exact form of the functions 
f and g is not critical. The graph of an example of a function f that has been 
found to yield good results is shown in Figure 7. 

It will be appreciated by those of ordinary skill in the art that the 
mvention can be embodied in other specific forms without departing from the 

r OT — ^ CharaCter ™ C PrcSCnt * discI °^ embodiments are 

therefore considered in all respects to be illustrative and not restrictive The 
scope of the invention is indicated by the appended claims rather than the 
foregoing description, and all changes which come within the meaning and 
range of equivalents thereof are intended to be embraced therein 
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What is claimed is: 

1. A method of producing sound data, comprising the steps of: 
applying to an adaptive function mapper control parameters 
including: 

5 a time parameter; and 

at least one parameter selected from the set of timbre 
space coordinates, pitch, Apitch, articulation and dynamic; and 
using the adaptive function mapper, mapping from the control 
parameters to synthesis parameters to be applied to a sound synthesizer. 

10 2 - The method of Claim 1 , wherein the adaptive function mapper 

effects a several-fold multiplication in the number of parameters from the 
control parameters to the synthesis parameters. 

3. The method of Claim 1, wherein the time parameter is canonical 
time, derived by warping real time with respect to a plurality of sound samples 
to a common time base. 

4. The method of Claim 1, wherein the timbre space coordinates are 
coordinates within a perceptually uniform timbre space. 

5. The method of Claim 1, wherein the adaptive function mapper is 
a neural network. 



6. The method of Claim 1, wherein the adaptive function mapper is 
a simulated neural network. 



7. The method of Claim 1, wherein the synthesis parameters are to 
be applied to an additive sound synthesizer and include amplitudes of partials. 
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8. The method of Claim 7, wherein the synthesis parameters are to 
be applied to an inverse FFT additive sound synthesizer and include, in addition 
to the amplitudes of partials, noise parameters. 

9. The method of Claim 8, wherein the noise parameters include 
5 broadband noise parameters and narrowband noise parameters. 

10. A method of training an adaptive function mapper to produce in 
accordance with information stored in a mapping store, synthesis parameters' to 
be applied to a sound synthesizer, the method comprising the steps of: 

analyzing sounds to produce sound parameters describing the 

5 sounds; 

further analyzing the sound parameters to produce control 
parameters; 

applying the control parameters to the adaptive function mapper, 
the adaptive function mapper in response producing trial synthesis 
parameters comparable to the sound parameters; 

deriving from the sound parameters and the trial synthesis 
parameters an error measure in accordance with a perceptual error norm 
in which at least some error contributions are weighted in approximate 
degree to which they are perceived by the human ear during synthesis; 
and 

adapting the information stored in the mapping store in 
accordance with the error measure. 

11. The method of Claim 10, wherein the error contributions are 
weighted in accordance with a monotonically increasing, frequency-dependent 
weighting function. 
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12. The method of Claim 11, wherein the error contributions are 
further weighted in accordance with a monotonically increasing function of a 
time derivative of the error measure. 

13. The method of Claim 10, wherein the control parameters include: 
5 at least one parameter selected from the set of time and 

timbre space coordinates; and 

at least one parameter selected from the set of pitch, 
Apitch, articulation and dynamic. 

14. The method of Claim 13, wherein the time parameter is 

10 canonical time, derived by warping real time with respect to a plurality of 
sound samples to a common time base. 

15. The method of Claim 13, wherein the timbre space coordinates 
are coordinates within a perceptually uniform timbre space. 

16. The method of Claim 10, wherein the adaptive function mapper 
15 effects a several-fold multiplication in the number of parameters from the 

control parameters to the synthesis parameters. 

17. The method of Claim 10, wherein the adaptive function mapper 
is a neural network. 



18. The method of Claim 10, wherein the adaptive function mapper 
20 is a simulated neural network. 



19. The method of Claim 10, wherein the synthesis parameters are to 
be applied to an additive sound synthesizer and include amplitudes of partials. 
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20. The method of Claim 19, wherein the synthesis parameters are to 
be appbed to an inverse FFT additive sound synthesizer and include, in addition 
to the amplitudes of partials, noise parameters. 

21. The method of Claim 20 whemin th» 

u ' wnercm me noise parameters include 

broadband noise parameters and narrowband noise parameters. 

22. An apparatus for producing sound data, comprising: 
an adaptive function mapper; 

means for applying to the adaptive function mapper control 
parameters including: 

at least one parameter selected from the set of time and 
timbre space coordinates; and 

at least one parameter selected from the set of pitch, 
Apitch, articulation and dynamic; and 

the adaptive function mapper comprising means for 
mapping from the control parameters to synthesis parameters to . 
be applied to a sound synthesizer. 

23. An apparatus for producing synthesis parameters to be applied to 
a sound synthesizer, the apparatus comprising: 
an adaptive function mapper; 

a mapping store coupled to the adaptive function mapper 
means for analyzing sounds to produce sound parameters 
describing the sounds; 

means for further analyzing the sound parameters to produce 
control parameters; 
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means for applying the control parameters to the adaptive 
function mapper, the adaptive function mapper in response producing 
trial synthesis parameters comparable to the sound parameters; 

means for deriving from the sound parameters and the trial 
synthesis parameters an error measure in accordance with a perceptual 
error norm in which at least some error contributions are weighted in 
approximate degree to which they are perceived by the human ear 
during synthesis; and 

means for adapting the information stored in the mapping store in 
accordance with the error measure. 

24. A method of producing sound data, comprising the steps of: 
applying to an adaptive function mapper a trigger input signal; 

and 

using the adaptive function mapper, producing synthesis 
parameters to be applied to a sound synthesizer, said synthesis 
parameters including a time constant. 

25. The method of Claim 24, wherein the adaptive function mapper 
effects a several-fold multiplication in the number of parameters from the 
control parameters to the synthesis parameters. 

26. The method of Claim 24, wherein the timbre space coordinates 
are coordinates within a perceptually uniform timbre space. 



27. The method of Claim 24, wherein the adaptive function mapper 
is a neural network. 
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28. The method of Claim 24, wherein the adaptive function mapper 
is a simulated neural network. 



29. The method of Claim 24, wherein the synthesis parameters are to 
be applied to an additive sound synthesizer and include amplitudes of partials. 

30. The method of Claim 24, wherein amplitudes of partials and time 
constants are produced in one-to-one correspondence. 
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